Entering the ‘New Frontier’ of Mathematics Assessment: Designing 
and Trialling the P VAT-0 (online) 


Angela Rogers 

RMIT University 
<angela.rogers@rmit.edu.au> 


As we move into the 21st century, educationalists are exploring the myriad of possibilities 
associated with Computer Based Assessment (CBA). At first glance this mode of 
assessment seems to provide many exciting opportunities in the mathematics domain, yet 
one must question the validity of CBA and whether our school systems, students and 
teachers are ready to harness this form of assessment. The most obvious advantages of 
CBA are the speed and accuracy of accessing results and the opportunities for innovative 
item development. This paper will aim to highlight how several factors can obstruct the 
validity and reliability of this assessment mode, particularly at an item level. These threats 
to validity must be carefully considered by test designers to ensure CBA is used effectively 
in primary school mathematics classrooms. 


Throughout my seven years teaching in Victorian Catholic Primary Schools, I was 
constantly dismayed by the level of superficial whole number place value understanding 
displayed by Year 3-6 students. Similar difficulties were found in research focussing on 
student’s place value understanding beyond two-digits (Major, 2011; Thomas, 2004). The 
problems students exhibited in Year 3-6 seemed to be related to a lack of quality 
assessment instruments available to assist teachers to accurately assess this “critical area of 
mathematics” (Major, 201 1, p. 82). 

Without access to quality place value assessments, Year 3-6 teachers were being 
impeded in their attempts to improve place value teaching and learning. As such, my 
research has centred around developing a comprehensive whole number Place Value 
Assessment Tool (PVAT) for Year 3-6 students (Rogers, 2012). 

Throughout the course of this research, another reality of classroom teaching and 
assessment has become apparent- the time factor (Ketterlin-Geller, 2009). During the 
PVAT paper and pen (P&P) trials, teachers expressed their pleasure at the insights the test 
provided, yet they were concerned about the time taken to correct and collate the test data. 
This led to the investigation of the possibility of creating an online version of the PVAT. 

This paper will describe the process of designing and trialling the PVAT-0 (online), 
report on its comparability with the P&P version of the PVAT and discuss the implications 
of using this test in a current school setting. 

Literature 

Whole number place value is a very difficult concept for students to grasp (Rogers, 
2012). The complexities associated with the acquisition of two-digit place value 
knowledge have been well documented (Baroody, 1990; Ellemor-Collins & Wright, 2009). 
Yet research by Thomas (2004) and Major (2011) suggest the difficulties students 
encounter comprehending and applying the recursive multiplicative structure of the 
number system beyond two-digits, are just as widespread. 

Assessment, by its very nature is designed to measure a particular attribute (Morgan, 
2000). The primary purpose of this ‘measurement’ is the opportunity it provides for 
teachers to gain a better understanding of their students’ level of this attribute and thus 
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hopefully improve the effectiveness of their teaching (Morgan, 2000). A comprehensive 
place value assessment which addressed the seven aspects of place value {count, 
make/represent, name/record, rename, compare/order, calculate and estimate) (Rogers, 
2012) was required to provide teachers with the detailed information they need to conduct 
targeted teaching at each student’s point of need (Vale, Weaven, Davies, & Hooley, 2010). 

Traditionally mathematics assessment has been delivered via paper and pencil (P&P) 
means (Griffin, McGaw, & Care, 2012). However, as we move further into the 21st 
century, computer based assessment (CBA) provides exciting opportunities for the 
advancement of the mathematics evaluative process. Just as computers provide many 
avenues for teachers to trial new ways of teaching, CBA provides opportunities for test 
developers to explore a multitude of possibilities. This, coupled with the recognition that 
“doing mathematics with the assistance of a computer is now part of mathematical 
literacy” (Stacey, 2012, p. 11), has led many, including large scale tests such as PISA and 
NAP LAN, to move towards investigating the potential of CBA (Tout & Spithill, 2012, 
December). 

CBA can be utilized in several ways in a mathematics assessment context. These 
include facilitating the design of assessments which better address existing constructs 
(Csapo, Ainley, Bennett, Latour, & Law, 2012), those which address totally new constructs 
(Stacey & William, 2013) and those which deliver traditional assessment in a more 
efficient and effective manner (Bridgeman, 2009). Within each of these categories there 
are also different features of the CBA platfonn which can be utilized, including fixed and 
adaptive testing or Internet and other delivery systems (Stacey & William, 2013). While 
each avenue has its own challenges and validity issues, all provide opportunities for 
mathematics assessment which have previously been impossible. 

Much research associated with CBA has addressed the comparison of a traditional P&P 
based test with its CBA equivalent (Bennett et ah, 2008; Poggio, Glasnapp, Yang, & 
Poggio, 2004; Thomson & Weiss, 2009; Wang, Jiao, Young, Brooks, & Olson, 2007). 
Wang et al. (2007) conducted a meta-analysis of 44 mathematics based assessments which 
compared P&P and CBA versions of the same test, and reported that overall the mode of 
administration did not have a statistically significant effect on the tests. This supported the 
work of Poggio et al. (2004) who reported that “there existed no meaningful statistical 
differences” (p. 30) between the two modes in their research. However, Poggio et al. 
(2005) did discover, at an item level, there were more substantiative differences. 

Item level functioning differences were also explored by Bennett et al. (2009). Their 
study used two randomly parallel groups of students and found the CBA to be significantly 
more difficult statistically than the P&P test. The results from this study led Csapo et al. 
(2012) to warn that until further studies with alternate research designs, such as that used 
by Bennett et al. (2008), are conducted, the view that P&P is comparable to CBA should, 
at best, be “viewed as preliminary” (p. 184). 

The reasons for the differences in student performance at an item level between the 
P&P and CBA mode are varied and can be difficult to pinpoint. Csapo et al. (2012) suggest 
that factors such as the quality of graphics available on a CBA platform have been found to 
affect the way a student interacts with items. Computer graphics are considered to provide 
“richer stimulus material” (Csapo et al., 2012, p. 153 ), which affects astudents’ 
engagement (Stacey, 2012) and potentially their proficiency to answer the item. Thus the 
importance of researchers closely analysing differences in item functioning is critical to 
ensuring accurate comparisons are made between the two modes. 
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Not only does CBA provide opportunities for traditional P&P tests to be converted into 
more efficient computer based forms, it has many applications in the design and 
presentation of innovative new ways of assessing mathematics. As Stacey (2012) points 
out, CBA items can be “more interactive, authentic and engaging” (p. 1 1). The use of item 
formats such as ‘drop and drag’, ‘radio buttons’ and the possibility of using “dynamic 
stimuli” (Csapo et ah, 2012, p. 149) like audio, video or animation allow greater scope for 
targeting aspects of mathematical constructs that have never been tested before. Yet, as 
Csapo et al. (2012) suggest, CBA also brings forth many challenging validity issues. 

The work of Lowrie and Diezmann (2009) although focusing on P&P assessment, 
questions whether assessment items which include graphical representations are measuring 
students’ ability to decode and interpret the graphics rather than assessing their content 
knowledge. These questions become particularly pertinent in the CBA domain, where the 
use of dynamic stimuli such as images, video or audio input, could potentially change the 
skill or content that is intended to be assessed (assuming coping with this stimulus is not 
the intended outcome of the assessment). This phenomenon was noted in the PISA 2006 
computer based assessment of science (CBAS) trial, where differences in item scores were 
not a result of the mode of delivery but of a feature that was associated with the delivery 
mode (Csapo et al., 2012). This poses significant challenges for CBA test developers. 

While in P&P mode, administration errors such as printing or missing pages can cause 
validity issues, research by Bridgeman, Lennon and Jackenthal (2003) reported that 
variations such as screen size, screen resolution and display rate can all influence the way 
students experience CBA. Thompson and Weiss (2009) explain how Internet-based 
assessments have validity issues associated with the bandwidth, the browser used to access 
the sites and the general capabilities of the school computer facilities. Clearly these 
external factors are not within the control of most test developers and thus are difficult to 
address, making the pursuit of a valid and reliable CBA even more challenging. 

Methodology 


Background to research 

The first phase of this research involved the construction of a Hypothetical Learning 
Trajectory (HLT) (Simon, 1995) addressing the seven aspects of place value (Rogers, 
2012). Assessment items were designed to target a range of difficulties within each of 
these aspects and these formed the basis of the PVAT P&P test. This test was piloted at 
two primary schools (A and B) in metropolitan Melbourne. The test was a 45 minute P&P 
test suitable for Year 3-6 students and included a total of 78 short answer items which 
teachers considered time consuming to score. The theoretical paradigm underpinning this 
research is informed by Cobb’s (1996) emergent perspective that acknowledges the social 
and psychological elements at play in the construction of meaning. 

PVAT-O trial 

The trial for the PVAT-0 was conducted at a Catholic Primary school (School C) in 
metropolitan Melbourne. The school had approximately 253 Year 3-6 students across nine 
classes, which were all involved in the trial. The trial was undertaken using a 
counterbalanced measures design (Shuttleworth, 2009). This research design required half 
the students in each class (randomly selected) to complete the PVAT-O, and then exactly 
one week later complete the PVAT. Concurrently the other half of each class completed 
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the PVAT followed by the PVAT-0 one week later. The counterbalanced research design 
was used to minimise factors such as learning effects and order of treatment, adversely 
influencing the results of the trial (Shuttleworth, 2009). 

Both test forms were identical in their mathematical content. The selected items had 
previously been validated using Rasch analysis in their P&P form (during School A and B 
trial) and covered a range of item difficulties and aspects of place value (Rogers, 2012). 

The items and images used in each mode were “identical”. However, as the items were 
originally written in the P&P form, some needed to be slightly altered for the CBA 
platform. Figure 1 shows how “Question 46” in the PVAT-0 required students to click the 
dots to colour them, while this item in the PVAT (Figure 2) required traditional colouring 
skills. This item was designed to address the “count” (Rogers, 2012) aspect of place value. 


Place Value Assessment Tool (PVAT) 


QUESTION 46 


Look at this picture. Here are 100 dots 


Colour 10 000 dots in red by clicking 
on them. 




Look at this picture- Here are 100 dots 

Colour 10 000 dots in red 




Figure 1. PVAT-0 (Count Item). Figure 2. PVAT (Count Item). 

Another major difference between the two modes was the inclusion of an “audio assist” 
button on each PVAT-0 item (Figure 1). Each student was provided with earphones during 
the PVAT-0 test and could choose to click on the “audio assist” button to hear the text 
being read. In order to gain an indication of the frequency this feature was used throughout 
the trial, students were asked to record the number of times they employed this assistance. 
This feature allowed PVAT-0 students to choose to read the item themselves or listen to 
the text being read. In the PVAT, students could only read the item themselves. 

The time taken for students to complete both the PVAT and the PVAT-0 was recorded 
in order to determine the duration of each mode. After the students had completed both 
tests they were asked to complete a short survey indicating the version of the test they 
preferred and reasons associated with their selection. All students completing both versions 
of the test were supervised by the researcher. 

The data from the paper version were coded and scored by the researcher using the 
same criteria as the online version. Each PVAT-0 item response was double checked by 
the researcher to confirm the online database was consistently scoring the student 
responses, ensuring valid data for the analysis. 

Importantly while 253 students were involved in the research, due to several 
unavoidable circumstances including student absence, technological issues and challenges 
in matching the birthdate students entered on the PVAT-0 with that on the PVAT, there 
was a total of 227 students (M=45%, F=55%) who were identified as completing both 
forms of the test. The analysis was restricted to these students. 
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Rasch Analysis 

A Rasch analysis (Adams & Khoo, 1996) was conducted on the data collected from the 
trial to address the question of whether the PVAT and the PVAT-0 could be considered 
substantively different in their difficulty. Rasch is a probabilistic model that measures item 
difficulty and student achievement on the same logit scale (Siemon, Breed, Dole, Izard, & 
Virgona, 2006). Three Rasch analyses (Run A, B and C) were used to determine the mean 
of the item difficulties from each mode. The first analysis (Run A) looked at student 
responses to the PVAT items. The results from Run A created an anchor file composed of 
the items which were considered to be valid and reliable in this mode of administration. 

The next analysis (Run B) looked at student responses to the PVAT-0 items. The items 
which were considered valid and reliable from this analysis were then used in Run C. The 
final analysis (Run C) combined the student responses to each item on both the PVAT and 
the PVAT-O. The anchor file from Run A was used in this analysis as it allowed the item 
difficulty estimates of the surviving PVAT items to be fixed, so that the PVAT-0 items 
could then be calibrated against them (Izard, 2005). The purpose of Run C was to 
investigate if items difficulties varied by administration mode when anchoring was used. 

The item difficulties for each mode were collated and the mean for the PVAT and 
PVAT-0 was calculated from Run C. These results were compared using effect size 
measures as this would provide a simple way of quantifying the difference between the 
item difficulties and student achievement on both tests. Run C of the Rasch analysis also 
allowed for the mean ability of the students who completed both tests to be compared. 

Results 

Table 1 summarises the findings of the Rasch analysis (Run C) which looked at the 
comparison of item difficulties in the PVAT and PVAT-O. 


Table 1 

Effect Size Estimates for Items by PVAT Administration Mode (Anchored Run) 



PVAT items (N=46) 

PVAT-0 items (N=62) 

Mean 

0.34 

0.37 

Mean Difference 

0.03 


Standard Dev. 

2.16 

1.86 

Pooled Std. Dev. 

1.99 


Effect Size (Std error) 

0.02 (0.19) 


Descriptor 

Very Small (Izard, 2004, March) 



The effect size measure calculated for the comparison of the PVAT and the PVAT-0 
was calculated to be 0.02. This is described to be a “very small (0.00-0.14)” (p. 8) 
magnitude of effect size (Izard, 2004, March). This suggests that in this study there was not 
a substantiative difference between the two modes of administration. 

Table 2 summarises the findings from the Rasch analysis (Run C) which looked at the 
difference in estimates of student achievement between the PVAT and PVAT-O. 
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Table 2 

Effect Size Estimates for Students by PVAT Test Administration Mode 



PVAT (N=227) 

PVAT-0 (N=227) 

Mean 

0.61 

0.63 

Mean Difference 

0.02 


Standard Dev. 

0.23 

0.19 

Pooled Std. Dev. 

0.21 


Effect Size (Std error) 

0.09 (0.09) 


Descriptor 

Very small (Izard, 2004, March) 



The effect size measure calculated for the comparison of the student ability of those 
completing the PVAT and PVAT-0 was calculated to be 0.09. This is described to be a 
“very small” (p. 8) magnitude of effect size (Izard, 2004, March). This suggests there are 
no substantiative differences between the students’ achievement in each mode. 

Discussion 

It must be noted that for this type of analysis, the trial involved a relatively small 
sample, both in the number of students and the number of items. This limits the scope of 
conclusions that can be made from the research, particularly at an item level. The trial 
should be considered merely a population of items and students, not a representative 
sample. However, with this in mind it is interesting to note that consistent with other 
research in this area (Poggio et ah, 2004; Thomson & Weiss, 2009), at an item level there 
are several cases where substantiative differences emerge across the two modes. 

The differences noted at item level highlights the importance of investigating the 
features of items that may be influencing the way students are approaching them, 
particularly in the CBA mode. The use of graphics and other features such as ‘drop and 
drag’ are factors which provide great opportunities for innovative item development but 
also should be considered to be possible threats to the validity of the item. 

The difficulty with CBA is accurately pinpointing the factors which are affecting the 
way students approach items. For example, the data collected in this study suggests that 
42% of students reported using the audio assist button on an average of 3.79 items. This 
may or may not have affected the way these students answered such items. 

Another important aspect associated with CBA is the affective side of this mode of 
delivery. Csapo et al. (2012) notes that the level of proficiency and the general familiarity 
students have with computers can affect their level of interest and approach to CBA. The 
student surveys in this study suggest that 55% of students preferred completing the PVAT- 
O test, stating reasons such as “it’s easier to see the graphics”, “I like using computers 
more” and “you can listen to the question if you get stuck”. While those who preferred the 
paper version cited reasons like “it takes longer on t he computer”, “you can do m ore 
working out when you have it on paper in front of you” and “the computer is frustrating”. 
It should be noted that the PVAT-0 took students on average 37 minutes while the PVAT 
took an average of 32 minutes. This discrepancy was mostly due to the speed of the 
school’s internet capabilities. Some computers seemed to take a great deal longer than 
others to move through the PVAT-O, no doubt frustrating the students working on them. 

It became apparent that from a systems perspective, the capability of the school’s 
computer and technological infrastructure is of immense importance when implementing a 
CBA platfonn. In order to successfully facilitate CBA within a s chool, there are many 
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considerations that need to be made. These include addressing logistical issues such as 
sourcing enough computers and locating appropriate spaces to administer the test, and 
practical considerations such as making available assistance to deal efficiently with 
technological issues which commonly arise in this mode of testing. All of these variables 
influence the success and validity of the CBA testing process. Clearly this testing mode 
requires commitment from the school, teachers and students to be a success. 

Conclusion 

As our society continues to advance technologically, mathematics educators and test 
designers alike must carefully consider the future of traditional paper and pen assessments 
in light of the ever evolving computer based assessment platfonn. The advantages claimed 
for CBA include immediate results and the opportunities for innovative and creative item 
development. However, there are also many issues surrounding the validity and reliability 
of CBA, particularly at an item level. The results of this research paper suggest that, like 
previous research in this area (Poggio et ah, 2004; Wang et ah, 2007), the PVAT-0 and 
PVAT appear to be very similar in overall scores of item difficulty. However, at an item 
level there are several items which appear to display substantively different difficulty 
thresholds suggesting there may be features of the CBA platform which alter the construct 
being measured. This is an area of the PVAT-0 research which is currently being 
investigated further. Furthermore, the capabilities and resources of schools that choose to 
embrace this mode of assessment are all variables that are difficult to control and pose 
significant threats to validity and success of CBA. It seems Lowrie’s (2012, November) 
suggestion to “hasten slowly” when entering the new frontier of CBA is sound advice. 
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